An Approximation Ratio for Biclustering* 



Kai Puolamaki^ 
Sami Hanhijarvi* 
Gemma C. Garriga^ 

Helsinki Institute for Information Technology HUT 
Helsinki University of Technology 
P.O. Box 5400 
FI-02015 TKK 
Finland 

22 August 2008 



Abstract 

The problem of biclustering consists of the simultaneous clustering 
of rows and columns of a matrix such that each of the submatrices 
induced by a pair of row and column clusters is as uniform as possible. 
In this paper we approximate the optimal biclustering by applying 
one-way clustering algorithms independently on the rows and on the 
columns of the input matrix. We show that such a solution yields a 
worst-case approximation ratio of 1 + ^/2 under Li-norm for 0-1 valued 
matrices, and of 2 under L2-norm for real valued matrices. 

Keywords: Approximation algorithms; Biclustering; One-way clustering 

1 Introduction 

The standard clustering problem |8 consists of partitioning a set of input 
vectors, such that the vectors in each partition (cluster) are close to one 
another according to some predefined distance function. This formulation 
is the objective of the popular if-means algorithm (see, for example, [9]), 
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Figure 1: (a) An example binary data matrix X of dimensions 4x6, 
with rows and columns labeled with numbers and characters, (b) The 
optimal biclustering of X consists ofji^J,^} = {{1}>{2>3,4}} row clus- 
ters and {C*,C|,C|} = {{&, /}, {a, d, e}, {c}} column clusters when us- 
ing Li-norm, (c) Biclusters of the data matrix returned by our scheme, 
that is, using twice an optimal one-way clustering algorithm, once on the 4 
row vectors and another on the 6 column vectors, with Li-norm. Result- 
ing clusterings are {i?i,i?2} = {{1, 3, 4}, {2}} for rows and {C\,C2,Cs} = 
{{b, /}, {a, e}, {d, e}} for columns. For visual clarity, the rows and columns 
of the original matrix in (a) have been permuted in (b) and (c) by making 
the rows (and columns) of a single cluster adjacent. 



where K denotes the final number of clusters and the distance function is 
defined by the L2-norm. Another similar example of this formulation is the 
fT-median algorithm (see, for example, [3]), where the distance function is 
given by the Li-norm. Clustering a set of input vectors is a well-known NP- 
hard problem even for K = 2 clusters [1] . Several approximation guarantees 
have been shown for this formulation of the standard clustering problem 
(see [3 El 12] and references therein) . 

Intensive recent research has focused on the discovery of homogeneous 
substructures in large matrices. This is also one of the goals in the problem 
of biclustering. Given a set of N rows in M columns from a matrix X, a 
biclustering algorithm identifies subsets of rows exhibiting similar behavior 
across a subset of columns, or vice versa. Note that the optimal solution 
for this problem necessarily requires to cluster the N vectors and the M 
dimensions simultaneously, thus the name biclustering. Each submatrix of 
X, induced by a pair of row and column clusters, is typically referred to 
as a bicluster. See Figure [T] for a simple toy example. The main challenge 
of a biclustering algorithm lies in the dependency between the row and 
column partitions, which makes it difficult to identify the optimal biclusters. 
A change in a row clustering affects the cost of the induced submatrices 
(biclusters), and as a consequence, the column clustering may also need to 
be changed to improve the solution. 

Finding an optimal solution for the biclustering problem is NP-hard. 
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This observation follows directly from the reduction of the standard cluster- 
ing problem (known to be NP-hard) to the biclustering problem by fixing 
the number of clusters in columns to M. To the best of our knowledge, no 
algorithm exists that can efficiently approximate biclustering with a proven 
approximation ratio. The goal of this paper is to propose such an approxi- 
mation guarantee by means of a very simple scheme. 

Our approach will consist of relieving the requirement for simultaneous 
clustering of rows and columns and instead perform them independently. 
In other words, our final biclusters will correspond to the submatrices of X 
induced by pairs of row and columns clusters, found independently with a 
standard clustering algorithm. We sometimes refer to this standard cluster- 
ing algorithm as one-way clustering. The simplicity of the solution alleviates 
us from the inconvenient dependency of rows and columns. More impor- 
tantly, the solution obtained with this approach, despite not being optimal, 
allows for the study of approximation guarantees on the obtained biclusters. 
Here we prove that our solution achieves a worst-case approximation ratio 
of 1 + y/2 under Li-norm for 0-1 valued matrices, and of 2 under L2-norm 
for real valued matrices. 

Finally, note that our final solution is constructed on top of a standard 
clustering algorithm (applied twice, once in row vectors and the other in 
column vectors) and therefore, it is necessary to multiply our ratio with 
the approximation ratio achieved by the used standard clustering algorithm 
(such as IS] ) • For clarity, we will lift this restriction in the following proofs 
by assuming that the applied one-way clustering algorithm provides directly 
an optimal solution to the standard clustering problem. 

1.1 Related work 

This basic algorithmic problem and several variations were initially pre- 
sented in [5] with the name of direct clustering. The same problem and its 
variations have also been referred to as two-way clustering, co-clustering or 
subspace clustering. In practice, finding highly homogeneous biclusters has 
important applications in biological data analysis (see [TO] for review and 
references), where a bicluster may, for example, correspond to an activa- 
tion pattern common to a group of genes only under specific experimental 
conditions. 

An alternative definition of the basic biclustering problem described in 
the introduction consists on finding the maximal bicluster in a given matrix. 
A well-known connection of this alternative formulation is its reduction to 
the problem of finding a bi clique in a bipartite graph [7J. Algorithms for 
detecting bicliques enumerate them in the graph by using the monotonicity 
property that a subset of a biclique is also a biclique (TJ |5] . These algorithms 
usually have a high order of complexity. 
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2 Definitions 



We assume given a matrix X of size N x M, and integers K r and K c , which 
define the number of clusters partitioning rows and columns, respectively. 
The goal is to approximate the optimal biclustering of X by means of a 
one-way row clustering into K r clusters and a one-way column clustering 
into K c clusters. 

For any T G N we denote [T] = {1,...,T}. We use X(R,C), where 
R C [N] and C C [M] , to denote the submatrix of X induced by the subset 
of rows R and the subset of columns C. Let Y denote an induced submatrix 
of X, that is Y = X(R, C) for some R C [N] and C C [M] . When required 
by the context, we will also refer to Y = X(R, C) as a bicluster of X and 
denote the size of Y with n x m, where n < N and m < N. We use 
median(y) and mean(Y) to denote the median and mean of all elements of 
Y, respectively. 

The scheme for approximating the optimal biclustering is defined as fol- 
lows. 



Input: matrix X, number of row clusters K r , number of column clus- 
ters K c 

TZ = kcluster(X, K r ) 
C = kcluster(X T ,K c ) 

Output: a set of biclusters X(R, C), for each R G 1Z, C G C 



The function kcluster(X, K r ) denotes here an optimal one-way clustering 
algorithm that partitions the row vectors of matrix X into K r clusters. We 
have used X T to denote the transpose of matrix X. 

Instead of fixing a specific norm for the formulas, we use the dissimilarity 
measure V() to absorb the norm-dependent part. For Li-norm, V() would 
be defined as V(Y) = J2 y eY \ v ~ median(Y)|, and for L2-norm as V(Y) = 
^2yeY (v ~ mean(y)) 2 . Given Y of size n x m, we further use a special row 
norm, Vr(Y) = Y^jLi V(^([ n ]> j))> an< ^ a special column norm, Vc(Y) = 

ELiV(mH))- 

We define the one-way row clustering, given by kcluster above, as a 
partition of rows [TV] into K r clusters 1Z = {R\, . . . , Rk t } such that the cost 
function 

M 

Lk^^V^j]) (1) 

Ren j=i 

is minimized. Analogously, the one-way clustering of columns [M] into K c 
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clusters C = {C\, . . . , Ck c } is denned such that the cost function 

N 

Lc^^V^C)) (2) 

i=i cec 

is minimized. 

The cost of biclustering, induced by the two one-way clusterings above, 

is 

L=^^V(I(i?,C)). (3) 

Rencec 

Notice that we are assuming that the one-way clusterings above, denoted 
1Z on rows and C on columns, correspond to optimal one-way partitionings 
on rows and columns, respectively 

Finally, the optimal biclustering on X is given by simultaneous row and 
column partitions 1Z* = {R{, ■ ■ ■ , R*k t } an d C* = {C*, . . . , C^ c }, that mini- 
mize the cost 

L *= E E V(X(R*,C*)). (4) 

r*&tz* c*ec* 

3 Approximation ratio 

Given the definitions above, our main result reads as follows. 

Theorem 1. There exists an approximation ratio of a such that L < aL* , 
where a = 1 + y/2 w 2.41 for L\-norm and X G {0, i} ArxM ; an d a = 2 for 
L 2 -norm and X € R NxM . 

We use the following intermediate result to prove the theorem. 

Lemma 2. There exists an approximation ratio of at most a, that is, L < 
aL*, if for any X and for any partitionings 1Z and C of X, all biclusters 
Y = X(R, C), with R £ TZ and C G C, satisfy 

V{Y)< l -a{V R {Y) + V c {Y)). (5) 

Proof. First we note that the cost of the optimal biclustering L* cannot 
increase when we increase the number of row (or column) clusters. For 
example, consider the special case where K r = N (or K c = M). In such 
case, each row (or column) is assigned to its own cluster and the cost of 
the optimal biclustering equals the cost of the optimal one-way clustering 
on columns Lc (or rows Lr). Hence, the optimal biclustering solution is 
bounded from below by 

L*> max (L fl ,L c )> ^(L R + L C ) (6) 
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Summing both sides of Equation (|5| , 



E E v(y)\y= X (r,c) < \a e E ( v «( y ) + V ^ Y )) W=x 



{R,C)i 



Rencec 



Rencec 



and using Equations Q, Q and (|3j, gives L < ^a(Ln + Lq), which to- 
gether with Equation Mti implies the approximation ratio of L < aL* . □ 



Theorem [T] is proven separately in Sections 3.1 and 3.2 using Lemma 
Section |3.1| deals with the case of having a 0-1 valued matrix X and 



Li-norm distance function, while Section 3.2 deals with real valued matrix 
X and L2-norm. 



3.1 Li-norm and 0—1 valued matrix 
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Figure 2: Examples of swaps performed within bicluster Y for the technical 
part of the proof in Section |3.1| For clarity, the rows and columns of the 
bicluster Y have been ordered such that the blocks A, B, C and D are 
continuous. 



Consider a 0-1 valued matrix X and Li-norm. To prove Theorem [T] 
it suffices to show that Equation ^ holds for each of the biclusters Y = 
X{R,C) of X, where R € 1Z and C £ C. Therefore, in the following we 
concentrate on one single bicluster Y G {0,l} nxm . 

Without loss of generality, we consider only the case where the bicluster 
Y has at least as many 0's as l's. In such case, the median of Y can be safely 
taken to be zero and the cost V(Y) < \nm is then fixed to the number of 
l's in the matrix. To get the worst case scenario towards the tightest upper 
bound on a in Equation (|5]), we should find first a configuration of l's such 
that, given V(Y), the sum Vr(Y) + Vq(Y) is minimized. 

Denote by Or and Oc the sets of rows and columns in Y which have more 
l's than 0's, respectively. Denote A = Y(Or,O c ), B = Y(Or, [m] \ O c ), 
C = Y([n] \ Or, O c ),D = Y([n] \ Or, [m] \ O c ), ri = \0 R \ and m! = \O c \. 
Note that A, B, C and D are simply blocks of bicluster Y , which we need 
to make explicit in our notation for the proof. 
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Changing a to 1 in A or a 1 to in D decreases Vr(Y) + Vc(Y) by 
two, while changing a to 1 or 1 to in B or C changes Vr(Y) + Vc(Y) 
by at most one. It follows that swapping a 1 in B or C with a in A (see 
Figure^), or swapping a 1 in D with a in A, 5 or C (see Figure [2)3) 
decreases Vr(Y) + Vc(^) while V(Y) remains unchanged. In other words, 
in a solution that minimizes Vr(Y) + Vc(Y) no such swaps can be made. 
In the remainder of this subsection, we assume that the bicluster Y satisfies 
this mentioned property. 

It follows that (i) A, B and C are blocks of l's, (ii) A is a block of 
l's and D is a block of O's, or (hi) B, C and D are blocks of O's. Denote 
by o() the number of l's in a given block. It follows that V(Y) = o{A) + 
o(B) + o(C) + o(D) < \nm, V R (Y) = nrri - o(A) + o(B) - o(C) + o(D) and 
Vc{Y) = n'm-o(A) —o(B) + o(C) +o{D). We denote x = n'/n, y = m'/m, 
a = o{A)/{nm), b = o{B)/{nm), c = o(C)/(nm) and d = o{D)/{nm) and 
rewrite Equation §5§ as 



( my) \ 



2 sup 



a + b + c + d 
x + y - 2a + 2d 



with constraints a + b + c + d G [0, x G [0, 1] y G [0,1], as well as 

(i) a = xy, b = x(l - y), c = (1 - x)y and d G [0,(1 - x)(l - y)]; (ii) 
a = xy, b G [0,x(l — y)], c G [0,(1 — x)y] and d = 0; or (hi) a G [0,xy] 
and b = c = d = 0. The optimization problem has two solutions, (i) 

x = y = 1 — y|, a = xy, b = x(l — y), c = (1 — x)y and d = 0, and 

(ii) x = y = a = xy and 6 = c = d = 0, both solutions yielding 

a = 1 + \A2 when exactly half of the entries in the bicluster Y are l's. This 
proves Theorem [T] for 0-1 valued matrices and Li-norm. 

Notice that the above proof relies on the fact that the input matrix X 
has only two types of values. Therefore, the proof does not generalize to 
real valued matrices. 

An example of a matrix with approximation ratio of 2 is given by a 
4 x (4<7 — 1) matrix 



X 



with q columns in the first column group, q columns in the second column 
group and 2q — 1 columns in the third column group, clustered to two row 
clusters, K r = 2, and one column cluster, K c = 1, at the limit of large 
q. The optimal one-way clustering of rows is given by 1Z = {{1, 2}, {3,4}}, 
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L = 8q — 2, and the optimal biclustering of rows by 1Z* = {{1, 3}, {2, 4}}, 
L* = 4q. 

3.2 L2-norm and real valued matrix 

Consider now a real valued matrix X and L2-norm. We want to prove 
Theorem [T] for the real valued biclusters Y of X. To find the approximation 
ratio, it suffices to show that Equation ^ holds for each bicluster Y £ M nxm , 
which are determined by Y = X(R, C), where R £ 1Z and C G C. 

Using the definitions of V(Y), Vr(Y) and Vc(Y), we can write V(Y) = 
MY) + V C (Y) - E"=i EJli {Y(i,j) - Y(i,j)) 2 < V R (Y) + V C (Y), where 
= mean(y([n], j)) + mean(y(i, [m])) — mean(y). Hence, Equation 
^ is satisfied for L2-norm and real valued matrices when a = 2. 

4 Conclusions 

We have shown that approximating the optimal biclustering with indepen- 
dent row- and column-wise standard clusterings achieves a good approxima- 
tion guarantee. However in practice, standard one-way clustering algorithms 
(such as i^-means or i'T-median) are also approximate, and therefore, it is 
necessary to multiply our ratio with the approximation ratio achieved by 
the standard clustering algorithm (such as presented in [3j [9] ) to obtain the 
true approximation ratio of our scheme. Still, our contribution shows that 
in many practical applications of biclustering, it may be sufficient to use 
a more straightforward standard clustering of rows and columns instead of 
applying heuristic algorithms without performance guarantees. 
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